This data set includes information about different Netflix shows and movies such as genre, run time, reviews from different sources, age certification, and year to name a few. The goal of this project is to predict the imdb_score of a movie from this set of factors. Netflix is an online streaming platform that lets the user select any show or movie to watch so they can watch on demand at any time. There are thousands of pieces of media on the platform and each show/movie has an accompanying IMDB score that is rated on a separate platform called IMDB, Internet Movie Database. This database has all the movies ratings, including reviews, votes cast by people and popularity score of that piece of media. This project aims to use machine learning in order to determine if I can make a model that can precisely predict the IMDB score of a film/show available on Netflix.
My family loves movies, my sister especially, so I knew I wanted to make a project that predicts the IMDB score of a movie to see how much a movies score really relies on performance and not themes or popularity. There are certain movies that just always do bad in theaters, like superhero movies or kids movies, so I was interested to see what actually makes a good movie or show. Netflix has also been my favorite streaming platform for a long time now, so I wanted to pull data from this platform. Sometimes I hear about a movie being “so good!” and then when I watch it I think it is not great and doesn’t deserve the hype. I am going to see whether the themes of show or the popularity of the show actually make a difference when it comes the the IMDB score.
# Assigning the data to a variable
netflix <- read.csv(file = '/Users/mackenzie/Documents/pstat 131/netflix data/titles.csv')
I obtained this data set from kaggle. This data set was created to list all shows/movies available on Netflix streaming and was acquired in July 2022 containing data available in the United States. The data was obtained by user Victor Soeiro.
https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies?select=titles.csv
Let’s look at our data and see what we’re working with!
# Calling head() to see the first few rows
head(netflix)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm154986 Deliverance MOVIE
## 4 tm127384 Monty Python and the Holy Grail MOVIE
## 5 tm120801 The Dirty Dozen MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action.
## 3 Intent on seeing the Cahulawassee River before it's turned into one huge lake, outdoor fanatic Lewis Medlock takes his friends on a river-rafting trip they'll never forget into the dangerous American back-country.
## 4 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 5 12 American military prisoners in World War II are ordered to infiltrate a well-guarded enemy château and kill the Nazi officers vacationing there. The soldiers, most of whom are facing death sentences for a variety of violent crimes, agree to the mission and the possible commuting of their sentences.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime
## 1 1945 TV-MA 51
## 2 1976 R 114
## 3 1972 R 109
## 4 1975 PG 91
## 5 1967 150
## 6 1969 TV-14 30
## genres production_countries seasons
## 1 ['documentation'] ['US'] 1
## 2 ['drama', 'crime'] ['US'] NA
## 3 ['drama', 'action', 'thriller', 'european'] ['US'] NA
## 4 ['fantasy', 'action', 'comedy'] ['GB'] NA
## 5 ['war', 'action'] ['GB', 'US'] NA
## 6 ['comedy', 'european'] ['GB'] 4
## imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
## 1 NA NA 0.600 NA
## 2 tt0075314 8.2 808582 40.965 8.179
## 3 tt0068473 7.7 107673 10.010 7.300
## 4 tt0071853 8.2 534486 15.461 7.811
## 5 tt0061578 7.7 72662 20.398 7.600
## 6 tt0063929 8.8 73424 17.617 8.306
Most of these columns will not be important when it comes to fitting a model, so I will drop those variables right now. The movie id and imdb_id are not needed if I have the movie title, so I will drop those values. I will also drop description, age_certification, and production_countries since those values will not be helpful when it comes to analysis. My goal is to predict the imdb_score, so I will drop the tmdb_score because those two values are very similar and I don’t want my model to predict imdb_score directly from tmdb_score. I will also drop the amount of seasons, since most of the data for those values are missing.
The genre variable has many different values, which is not helpful in our calculations. Ideally there would only be one value for this variable and since the genres are listed in order of what genre is most descriptive for that movie, we will take the first genres in the list of genres.
# Splitting genres into just the first element in the list
netflix$genres <- gsub("\\[|\\]|'", "", netflix$genre)
netflix$genres <- sapply(strsplit(netflix$genre, ","), function(x) trimws(x[1]))
# Removing variables that won't be helpful for analysis
netflix <- netflix %>%
dplyr::select(-title, -id, -imdb_id, -description, -age_certification, -production_countries, -tmdb_score, -seasons) %>%
# Turning categorical variables into factors
mutate(type = factor(type), genres = factor(genres))
I will check the size of the data to see if I have to cut any data out of the data set, since too many observation in the data will lead to overfitting.
# Calling dim() to see dimensions of Netflix data
dim(netflix)
## [1] 5850 7
There are 5,850 observations and 8 variables, 5,850 rows and 8 columns, in this data set. The amount of observations seems very high for our analysis, so I am going to look to see if there are any observations that should be eliminated due to missing values. Once I check for missing values, I may still have to cut down the amount of observations in order to continue my analysis.
# See if theres any missing data
sum(is.na(netflix))
## [1] 1130
There are a lot of missing values in this dataset, so let’s see where those values lie and what I can do to eliminate these values.
# Calling summary() function to see where there are missing values
netflix %>%
summary()
## type release_year runtime genres
## MOVIE:3744 Min. :1945 Min. : 0.00 drama :1421
## SHOW :2106 1st Qu.:2016 1st Qu.: 44.00 comedy :1305
## Median :2018 Median : 83.00 documentation: 665
## Mean :2016 Mean : 76.89 thriller : 377
## 3rd Qu.:2020 3rd Qu.:104.00 action : 365
## Max. :2022 Max. :240.00 (Other) :1658
## NA's : 59
## imdb_score imdb_votes tmdb_popularity
## Min. :1.500 Min. : 5.0 Min. : 0.0094
## 1st Qu.:5.800 1st Qu.: 516.8 1st Qu.: 2.7285
## Median :6.600 Median : 2233.5 Median : 6.8210
## Mean :6.511 Mean : 23439.4 Mean : 22.6379
## 3rd Qu.:7.300 3rd Qu.: 9494.0 3rd Qu.: 16.5900
## Max. :9.600 Max. :2294231.0 Max. :2274.0440
## NA's :482 NA's :498 NA's :91
Since there are 5,850 observations in this data set, we are able to remove the observations that do not contain all the information I will need for my analysis. It looks like there are missing values from imdb_score, imdb_votes and tmdb_popularity, so I will remove all the observations that have at least one missing value from the data set.
# Removing observations with missing values
netflix <- netflix %>%
na.omit()
# Seeing the new dimensions of Netflix
dim(netflix)
## [1] 5277 7
Now there are no more missing values in the data set. There are still 5,277 obervations in the data set, so I will cut the data to be more manageable. A smaller amount of observations will allow the models to not overfit to the data, so I am only going to take the first 300 observations. I also conducted my analysis with the full data set and the correlation matrix showed little correlation between values, so I decided to take a smaller amount of observations in order to see correlations.
# Trimming netflix to 300 observations
netflix <- netflix[1:300,]
dim(netflix)
## [1] 300 7
netflix %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = "lower", diag = FALSE)
There is a positive correlation between the amount of imdb_votes and imdb_score, which means that imdb_votes is a good indicator of the score. There is also a positive correlation between imdb_popularity score and imcb_votes, which means that the more popular the show the more votes will be cast on the movie. The popularity of the show and the number of votes are also positively correlated, which means that the more popular a show/movie, the more votes the show/movie will receive. There is a negative correlation between run time and both popularity and IMDB score. This means that as the runtime increases the popularity and rating of the show/movie will decrease.
Since there is a slight positive correlation between tmdb_popularity and imdb_votes, I will make a graph comparing those two variables.
netflix %>%
ggplot(aes(x= tmdb_popularity, y = imdb_votes)) +
geom_point() +
geom_smooth(method= 'lm', se = FALSE) +
labs(x = 'Popularity Score', y = 'Total Number of Votes', title = 'Comparison of Popularity and Votes') +
theme_minimal()
Some of the movies are much more popular than others, so it is hard to see the pattern between the points. I have fitted a regression line in order to see the relationship more clearly. As was predicted by the correlation plot, we can see that as tmdb_popularity increase imdb_votes increases. This is important because it indicates that popularity and votes may be important factors when it comes to determining IMDB score.
I will now explore how many genres there are in the data set and how many of each occur, since there may be a correlation between genre and popularity score.
# Make a frame of average imdb_score and associating genre
avgscr_genre <- netflix %>%
group_by(genres) %>%
summarise(average = mean(imdb_score, na.rm = TRUE))
# Plot Average imdb_score by genre
avgscr_genre %>%
ggplot(aes(x = genres, y = average, fill = genres)) +
coord_flip() +
geom_bar(stat = 'identity') +
scale_fill_discrete() +
labs(y = 'IMDB Score', x = 'Genres', title = 'Average IMDB Score By Genre')
Note: Colors Added for Aesthetic Purposes Only
As we can see from the graph, the average mean score across genres does not change very much. There are big discrepancies between the highest rated on average, which happens to be animation and the lowest rated, horror. This graph tells us there are differences in IMDB score by genre, but it doesn’t provide a lot of useful evidence to decide that genre is a deciding factor when it come to IMDB score.
I will split the data into about 70% training and 30% testing while also stratifying on the outcome variable, imdb_score.
# Split the data and stratify on imdb_score
netflix_split <- initial_split(netflix, strata = imdb_score, prop = 0.7)
net_train <- training(netflix_split)
net_test <- testing(netflix_split)
We will verify that we split the data correctly.
# See the proportions of the training and testing sets
nrow(net_train)/nrow(netflix)
## [1] 0.6933333
nrow(net_test)/nrow(netflix)
## [1] 0.3066667
The training set has 69.3% of the data and the testing set has 30.7% of the data, this is very close to the values that we want of 70% and 30%. This means that we split the data correctly into training and testing sets. We also stratified the training and testing sets on the outcome variable, imdb_score, by including strata = imdb_score.
In order to continue our analysis, we will need to set up a recipe that will be used for each model. We can modify the recipe later if needed, since some models have different requirements for the recipe. For example, some models, like the elastic net model, require the data to be normalized in order to fit. We set up our recipe using the net_train data, which is the training data that we created above. The outcome variable imdb_score is being predicted by the variables type, genres, release_year, runtime, imdb_votes and tmdb_popularity. I have also included step_center() and step_scale() in order to normalize the data, since most models require the data be normalized before analysis.
We have a couple categorical variables in this data set, genres and type. Since the categorical variables are not continuous, we must convert them into numeric values. In order to convert these to numeric values for analysis, I will dummy code them, which assigns a numeric value to each value and creates a column for each unique value. This means that for each unique genre, there will be a separate column and a unique value given to the observations that have the genre value in that unique genres column.
I have also included a prep() and bake() step in order to see the recipe I created.
# Creating recipe
netflix_recipe <- recipe(imdb_score ~ ., data = net_train) %>%
# Dummy coding categorical variable - genres
step_dummy(type, genres) %>%
# Dummy coding all nominal predictors
step_dummy(all_nominal_predictors(), -all_numeric_predictors()) %>%
# Removing variables with zero variance
step_zv(all_predictors()) %>%
# Normalizing
step_center(all_predictors()) %>%
step_scale(all_predictors())
# Let us prep and bake the recipe!
netflix_recipe %>%
prep() %>%
bake(new_data = net_train) %>%
head()
## # A tibble: 6 × 21
## release_year runtime imdb_votes tmdb_popularity imdb_score type_SHOW
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.28 1.40 -0.499 -0.429 2.1 -0.620
## 2 -1.36 1.09 -0.497 -0.450 6.2 -0.620
## 3 -3.03 1.34 -0.499 -0.477 6.4 -0.620
## 4 -1.72 0.336 -0.499 -0.474 6.4 -0.620
## 5 -1.45 0.983 -0.499 -0.472 4.4 -0.620
## 6 -1.19 0.462 -0.469 -0.198 4.9 -0.620
## # ℹ 15 more variables: genres_animation <dbl>, genres_comedy <dbl>,
## # genres_crime <dbl>, genres_documentation <dbl>, genres_drama <dbl>,
## # genres_family <dbl>, genres_fantasy <dbl>, genres_history <dbl>,
## # genres_horror <dbl>, genres_reality <dbl>, genres_romance <dbl>,
## # genres_scifi <dbl>, genres_thriller <dbl>, genres_war <dbl>,
## # genres_western <dbl>
# Creating folds and strtifying on outcome variable
netflix_folds <- vfold_cv(net_train, v = 10, strata = imdb_score)
I am utilizing K-fold cross validation in order to reduce variance in the data, which will reduce bias and give me more unbiased estimates of each of my models performance on the training data. K-fold cross validation allows us to take the average accuracy from multiple runs of training and testing data within the folds of training data instead of just one average from the original training and testing sets. I also stratify the folds on my outcome variable imdb_score, since I don’t want the data to become unbalanced in the folds.
Now that we have tidied the data, explored the data, and set up our recipe, we can build our models! The metric I have chosen to evaluate each of the models is the Root Mean Squared Error (RMSE). Since we are utilizing regression analysis to determine imdb_score, RMSE is the best metric to determine the validity of our models. RMSE is a consistent metric among all the regression models we will be fitting, so it is the easiest metric to use in order to determine which model fits the training data the best.
Each of the models follows a similar format. That means that all the models are created in a similar pattern, so I will organize this section by each step to fit the models.
# Linear Regression
lm_model <- linear_reg() %>%
set_mode('regression') %>%
set_engine('lm')
# K- Nearest Neighbors
# tuning the number of neighbors
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_mode('regression') %>%
set_engine('kknn')
# Ridge Regression
# tuning penalty and setting mixture = 0 to get ridge
rd_model <- linear_reg(mixture = 0, penalty = tune()) %>%
set_mode('regression') %>%
set_engine('glmnet')
# Elastic Net Regression
# tuning penalty and mixture
en_model <- linear_reg(penalty = tune(), mixture = tune()) %>%
set_mode('regression') %>%
set_engine('glmnet')
# Lasso Regression
# tuning penalty and setting mixture to 1 to get lasso
ls_model <- linear_reg(mixture = 1, penalty = tune()) %>%
set_mode('regression') %>%
set_engine('glmnet')
# Random Forest
# tuning mtry (number of predictors used), trees (total amount of trees) and
# min_n (minimum number of trees per node)
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode('regression') %>%
set_engine('ranger', importance = 'impurity')
# Boosted Trees
# tuning trees, learn_rate, and min_n
bt_model <- boost_tree(trees = tune(),
learn_rate = tune(),
min_n = tune()) %>%
set_mode('regression') %>%
set_engine('xgboost')
For each model we set up a workflow that includes both model and recipe.
# Linear Regression
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(netflix_recipe)
# K-Nearest Neighbors
knn_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(netflix_recipe)
# Ridge Regression
ridge_wflow <- workflow() %>%
add_model(rd_model) %>%
add_recipe(netflix_recipe)
# Elastic Net Regression
elastic_wflow <- workflow() %>%
add_model(en_model) %>%
add_recipe(netflix_recipe)
# Lasso Regression
lasso_wflow <- workflow() %>%
add_model(ls_model) %>%
add_recipe(netflix_recipe)
# Random Forest
randfor_wflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(netflix_recipe)
# Boosted Trees
boost_wflow <- workflow() %>%
add_model(bt_model) %>%
add_recipe(netflix_recipe)
For each model that is being tuned we must specify a range for the hyperparameters being tuned. We do this by specifying a tuning grid and manually enter ranges for each parameter.
# Linear Regression
# No parameters that need tuning - no grid
# K-Nearest Neighbors
knn_grid <- grid_regular(neighbors(range = c(1,10)), levels = 5)
# Ridge Regression/Lasso Regression
rdls_grid <- grid_regular(penalty(range = c(0,1)), levels = 50)
# Elastic Net Regression
en_grid <- grid_regular(penalty(), mixture(range = c(0,1)), levels = 10)
# Random Forest
rf_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(50,500)), min_n(range = c(5,20)), levels = 10)
# Boosted Trees
bs_grid <- grid_regular(trees(range = c(50, 200)), learn_rate(range = c(0.01,0.1), trans = identity_trans()), min_n(range = c(40, 60)), levels = 5)
Using the models, workflows, k-fold cross validation folds and our tuning grids for the hyperparameters, we will set up tuning grids in order to tune our parameters.
# Linear Regression
# Doesn't need to be tuned
# K-Nearest Negihbors
knn_tune <- tune_grid(knn_wflow,
resamples = netflix_folds,
grid = knn_grid)
# Ridge Regression
rd_tune <- tune_grid(ridge_wflow,
resamples = netflix_folds,
grid = rdls_grid)
# Elastic Net Regression
en_tune <- tune_grid(elastic_wflow,
resamples = netflix_folds,
grid = en_grid)
# Lasso Regression
ls_tune <- tune_grid(lasso_wflow,
resamples = netflix_folds,
grid = rdls_grid)
# Random Forest
rf_tune <- tune_grid(randfor_wflow,
resamples = netflix_folds,
grid = rf_grid)
# Boosted Trees
bs_tune <- tune_grid(boost_wflow,
resamples = netflix_folds,
grid = bs_grid)
The models take a long time to run, so we are saving time by saving them into separate RDS files, so when we need the files we can load them into the project again using the read_rds() function.
# Linear Regression
# no tuning grid to save
# K-Nearest Neighbors
write_rds(knn_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/knn_tune.rds')
# Ridge Regression
write_rds(rd_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/rd_tune.rds')
# Elastic Net Regression
write_rds(en_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/en_tune.rds')
# Lasso Regression
write_rds(ls_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/ls_tune.rds')
# Random Forest
write_rds(rf_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/rf_tune.rds')
# Boosted Trees
write_rds(bs_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/bs_tune.rds')
Loading in the saved RDS files with our function.
# Linear Regression
# no model saved to RDS file
# K-Nearest Neighbors
knn_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/knn_tune.rds')
# Ridge Regression
rd_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/rd_tune.rds')
# Elastic Net Regression
en_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/en_tune.rds')
# Lasso Regression
ls_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/ls_tune.rds')
# Random Forest
rf_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/rf_tune.rds')
# Boosted Trees
bs_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/bs_tune.rds')
Collecting the model with the best RMSE to determine which model produced the lowest RMSE.
# Linear Regression
lm_fit <- fit_resamples(lm_wflow, resamples = netflix_folds)
lm_results <- show_best(lm_fit, metric = 'rmse')
# K-Nearest Neighbors
best_neighbor <- select_by_one_std_err(knn_tuned, desc(neighbors), metric = "rmse")
# Ridge Regression
best_ridge <- select_by_one_std_err(rd_tuned, penalty, metric = "rmse")
# Elastic Net Regression
best_elastic <- select_by_one_std_err(en_tuned, penalty, mixture, metric = "rmse")
# Lasso Regression
best_lasso <- select_by_one_std_err(ls_tuned, penalty, metric = "rmse")
# Random Forest
best_randfor <- select_by_one_std_err(rf_tuned, mtry, trees, min_n, metric = "rmse")
# Boosted Trees
best_boost <- select_by_one_std_err(bs_tuned, trees, learn_rate, min_n, metric = "rmse")